Description

Credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

The Thera bank recently saw a steep decline in the number of users of their credit card. Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas.

As a Data scientist at Thera bank, I would like to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards. Thus, I need to identify the best possible model that will give the required performance

Objectives

1. Explore and visualize the dataset.
2. Build a classification model to predict if the customer is going to churn or not
3. Optimize the model using appropriate techniques
4. Generate a set of insights and recommendations that will help the bank

Data Dictionary:

CLIENTNUM: Client number. Unique identifier for the customer holding the account

Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"

Customer_Age: Age in Years

Gender: Gender of the account holder

Dependent_count: Number of dependents

Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.

Marital_Status: Marital Status of the account holder

Income_Category: Annual Income Category of the account holder

Card_Category: Type of Card

Months_on_book: Period of relationship with the bank

Total_Relationship_Count: Total no. of products held by the customer

Months_Inactive_12_mon: No. of months inactive in the last 12 months

Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months

Credit_Limit: Credit Limit on the Credit Card

Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance

Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)

Total_Trans_Amt: Total Transaction Amount (Last 12 months)

Total_Trans_Ct: Total Transaction Count (Last 12 months)

Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter

Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter

Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Read Dataset

The only continuous variables are: Customer_Age,Months_on_book, Credit_Limit, Total_Revolving_Bal,Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio

Education_Level and Marital_Status have missing values

More existing customers than attrited customers. More female, graduate, married, blue card, and more income earners less than 40K dollars in the dataset.

Let's carry out some univariate data analysis

Bivariate Analysis

Let's see if any of the variables have strong or negative correlation with Attrition_Flag

None of the variables have a strong relationship individually with the Attrition Flag.

The table supports the observations from the boxplots above.

Now let's do a final check on the data before proceeding to data preprocessing/feature engineering and model building

-There are missing values in Education_Level.

-There are missing values in Marital_Status.

The amount of observation is about 10%. So, we would not treat them as outliers.

The amount of observation is about 10%. So, we would not treat them as outliers.

Total_Amt_Chng_Q4_Q1 contains similar attributes in the main data. So, we won't treat them as outliers in this model

Total_Trans_Amt contains similar attributes in the main data. So, we won't treat them as outliers in this model

The two observations are not very similar, especially their Age, Gender, Income_Category, Credit_Limit, and Avg_Utilization_Ratio. So, possibly they are not outliers, they just have unique attributes.

Total_Ct_Chng_Q4_Q1 contains similar attributes in the main data. So, we won't treat them as outliers in this model

Using most frequent strategy to fill missing categorical data

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will remain and the customer doesn't - Loss of resources
  2. Predicting a customer will not remain and the customer remains - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False negatives?

Which case is also important?

How to reduce this loss i.e need to reduce False Positives?

Let's start by building different models using KFold and cross_val_score and tune the best model using GridSearchCV and RandomizedSearchCV

Oversampling train data using SMOTE

Let's train the model using the oversampled data

Undersampling train data using Random Undersampler

Let's train the model using the undersampled data

Hyperparameter tuning will be performed on all the models (with undersampled data) except Logistic Regression model which has performed relatively low.

Hyperparameter Tuning

Undersampled data gave better performance. So, those are the models that will be tuned to see if they might improve.

AdaBoost

Recall is good and the precision is also good.

XGB Classifier

Recall is high for both training and validation, however, the precision is too low. There will be many false positives.

GradientBoosting

Gradient Boosting Classifier performs better in all the metrics compared to Adaboost model

Random Forest

High recall, but low precision for Randomforest

Bagging

Bagging Classifier failed on recall and precision.

Decision Tree

The recall score is good, but precision is low for the Decision Tree Model

Comparing all models

The tuned models perform well on the training set except Bagging Classifier

The tuned models perform well on the validation set except Bagging Classifier. Tuned Gradient Boosting Classifier on undersampled dataset performed better on Accuracy, Recall, Precision and F1 scores; followed by Adaboost Classifier.

Tuned Gradient Boosting Classifier on undersampled dataset performed better on Accuracy, Recall, Precision and F1 scores.

Gradient Boosting Classifier performed better in Accuracy, Precision and F1 scores on the test data. So we are quite guaranteed of low false negatives (Recall Score over 97%), balanced prediction (F1 Score over 84%), and overall accuracy (Accuracy Score over 94%).

We used undersampled data to make sure the data observations are subset of the original data set and equal distribution of Attrition Flag (50:50). Hence all the metrics would count including accuracy, since we now have a balanced dataset.

Using Gradient Boosting model we are sure of high accuracy, recall, precision and F1 scores. Thus, the best model.

Let's look at the feature importance on Gradient Boosting model developed

The most important features are Total_Trans_Ct, Total_Trans_Amt, Total_Revolving_Bal, Total_Ct_Chng_Q4_Q1, Total_Amt_Chng_Q4_Q1, and Customer_Age

Pipelines for productionizing the model

The model performance of over 94% accuracy score from pipeline conforms with the accuracy before adopting pipeline. Hence the results are correct.

The model performance is great with accuracy score over 94.1%, recall score over 97.2%, Precision score over 74.1%, and F1 score over 84.1%.

Conclusion

Business Recommendation

Thank You!

Let's experience how pipeline ignores unnecessary columns